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METHODS AND APPARATUS. FOR SINGLE 
STAGE GALOIS FIELD OPERATIONS 

Field of the Invention 

The present invention relates generally to improvements to digital signal processing and 
more particularly to advantageous methods and apparatus for providing improved Galois field 
operations. 

Background of the Invention 

The operations of addition and multiplication utilizing Galois field (GF) arithmetic are 
very different fi-om the usual multiply and add instructions in digital signal processors (DSPs). 
Specialized instructions are therefore typically needed to perform the computations in a 
reasonable amount of time. The specialized instructions specify the inputs, the result destination 
and the type of GF operation to be executed. A GF multiplication operation of two input 
elements is an important function which signal processing units and DSPs may need to perform. 
In considering GF operations, there are at least two different ways to encode the elements of a 
GF: 1) using the polynomial coefficients as a vector of bits, or 2) using the exponent form. Both 
of these two encodings make the calculation of one of the operations easy, but the other more 
complex to calculate. For example, the GF addition in utilizing the polynomial coefficients 
approach is an exclusive or (XOR), while a multiplication of two elements in exponent form is 
an addition of the exponents. However, the multipUcation operation, utilizing the polynomial 
coefficient form, and, the addition operation, utilizing the exponent form, are typically both 
more complex to implement. 
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Further details of several prior art approaches are found in the following patents: "Galois 
Field Computer," U.S. Patent No. 4,162,480, "Multiplier in a Galois Field," U.S. Patent No. 
4,918,638, and "Galois Field Arithmetic Apparatus and Method," U.S. Patent No. 6,134,572. 
The first patent describes a table lookup for the GF multipUcation of GF(2^). The second patent 
uses two function stages to calculate a GF multiplication utiUzing a binary multiplier array for a 
first fimction stage and a polynomial reducer for the second function stage. The third patent uses 
the exponent representation form. 
Summary of the Invention 

Galois fields (GF) and the multiplication operation in such fields, have many applications 
in communication systems. Some examples are their utilization for error detection and/or 
correction, for cryptography, and the like. Due to the special meaning of GF multiplication, 
however, standard signal processors typically are inefficient in performing such a computation. 
It is therefore important to consider techniques and designs to efficientiy compute the operations 
needed in a signal processor, such as a DSP or a fixed fimction signal processor, over different 
Galois extension fields and generator polynomials. The present invention advantageously 
calculates the GF multiplication in polynomial coefficients form as a single fimction stage 
calculation by merging two function stages into a single new function stage. The new single 
fimction stage GF(2"') multiplication fiirther advantageously uses an m-by-m single fimction 
stage calculation array utilizing only m-bits per internal logic stage as compared with the 
previous two function stage approach of U.S. Patent No. 4,918,638, which calculated 2m-l bits 
from the first stage multipUcation array as inputs to the second stage polynomial reducer. The 
present invention provides a savings of m-1 bits per logic stage that do not need to be accounted 
for in the internal array implementation. One regular m-by-m array in accordance with the 
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invention may be constructed by replicating a common cell circuit design allowing for futher 
optimizations, for example, using custom logic design techniques to produce a common cell that 
is of higher performance and reduced area In addition, the GF multiplication array can be 
physically instantiated multiple times and used by DSP software programs with a specialized 
instruction to perform multiple GF multiplications on multiple data elements in a packed data 
format. For parallel DSPs with multiple processing elements (PEs), the specialized instruction 
can be used in programs to perform the packed data format GF muhiplications on the muUiple 
PEs in parallel. 

These and other advantages and aspects of the present invention will be apparent from the 
drawings and the Detailed Description which follow below. 

Brief Description of the Drawings 

Fig. 1 A illustrates a first function stage for computing polynomial multiplication terms in 
a traditional two function stage GF multiplication approach showing exemplary calculation 
traces for two polynomial multiplications; 

Figs. IB and IC illustrate the first function stage polynomial multiplications as if done by 
hand for the two examples of Fig. 1 A; 

Fig. 2 illustrates a second function stage for performing polynomial division in a 
traditional two function stage approach showing two reductions of the two dividends from Fig. 
1 A to generate two remainders which are the result of the GF multiplications; 

Fig. 3A illustrates a first logic stage with i=l calculating Y(l) for m=3 in accordance 
with the present invention; 
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Fig. 3B illustrates first and second logic stages with i=2 calculating Y(2) for m=3 in 
accordance with the present invention; 

Fig. 3C illustrates first, second and third logic stages with i=3 calculating Y(3), the GF 
multiplication result, for m=3 in accordance with the present invention; 

Fig. 4 illustrates a GF multiplication cell for construction of an m-by-m GF multiplication 
array in accordance with the present invention; 

Fig. 5 illustrates a single m=8 GF multipUcation unit which may suitably be used in a 
ManArray architecture processor in accordance with the present invention; 

Fig. 6 illustrates a Manta-type processor, a subset of the ManArray architecture, which 
may be suitably adapted for use in conjunction with the present invention; 

Fig. 7A illustrates an exemplary encoding format for a packed data finite field multiply 
instruction (MPYGF) in accordance with the present invention; and 

Fig. 7B shows a syntax/operation table for the MPYGF instruction of Fig. 7 A. 
Detailed Description 

To provide context for a hardware implementation, and instructions and software 
techniques for GF multiplication in accordance with the present invention, a brief description of 
GF arithmetic follows below. A field F(S,-i-,*,0,l) defines an algebraic entity consisting of a set 
of elements S, two arithmetic operations, addition and multiplication, denoted by the symbols + 
and *, respectively, closed over S and the corresponding identity elements for these operations, 
denoted by 0 and 1, respectively. A Galois field is a defined subset of a general field which is a 
set, such as the set of rational numbers, that can be manipulated by the mathematical operations 
of addition and multiplication. A Galois field is represented as GF(q) where q represents the 
finite number of elements of the field, for this case the integers {0, 1 , . . . , q- 1 } . For example, let 



4 



q=2 such that GF(2) has elements {0,1 }. An extension field of interest is GF(2"') where m is a 
positive integer. This Galois field has 2^ elements of the vector space of dimension m over 
GF(2). It has been proven that every finite field is isomorphic to a Galois field. Also, any field 
GF(2"') can be generated using a primitive polynomial P of degree m over GF(2), and the 
arithmetic performed in the GF(2"') field is modulo this primitive polynomial 

A traditional implementation of GF multiplication consists of two fimction stages, which 
perform the GF multiplication much as it would be computed by hand. For fiirther exemplary 
details of such an implementation, see U.S. Patent No. 4,918,638. During a first function stage, a 
polynomial multiplication is implemented, where the polynomials have coefficients fi-om GF[2]. 
This operation is a modification of a typical multiplier where the carries are not calculated or 
propagated. In the second function stage, the corresponding polynomial division by the 
generator polynomial is implemented where the remainder is calculated as discussed further 
below. 

The GF multiplication can be derived as follows: let g[x] be the generator polynomial of 
GF[2"^]. Let polynomials p[x] and q[x] be members of GF[2"^] / g[x]. The GF multiplication is 
the calculation of the remainder of the polynomial division of the product p[x]*q[x] divided by 
g[x] in GF[2] arithmetic. It is noted that a polynomial with coefficients in GF[2] multiplied by 
the monomial x*" is equivalent to shifting the bit vector of the polynomial coefficients to the left 
m positions with zero fill in. Polynomial addition is equivalent to performing a bit wise 
exclusive or (XOR) operation. Also, note that subscripts represent a specific bit position in a bit 
vector, the asterisk symbol (*) represents bit ANDs, and the symbols + and ® represent the XOR 
operation. 
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Polynomial multiplication may be calculated by a recurrence relation equation 1 that 
defines the polynomial S(i) : 

S(i) = S(i-l) + qm-i* P * x"""' , for i = 1 ...m, and where S(0) = 0 Equation 1 

where S(m) is the polynomial product of p[x]*q[x]. For example, Fig. 1 A shows a first stage 
polynomial multiplication table 100 with equation 1 multiplication terms 102 and calculation 
results 104 and 106 for a common input p=(p4 P3 P2 Pi Po) and two different inputs q=(q4 qa q2 qi 
qo) form=5. For calculation results 104, p=l 1101 and q=101 11. For calculation results 106, 
p=l 1 101 and q=10010. Figs. IB and IC show multiplication operations 150 and 170 for the two 
examples 104 and 106, respectively, of Fig. 1 A as if done by hand. 

The polynomial remainder of the division of p[x]*q[x] divided by g[x] is calculated by 
the recurrence relation equation that defines the polynomial Z(i) : 

Z(i) = Z(i-l) + Z(i-l)2m-i* g * x""'^ , for i = 2...m, and where Z(l) = S(m) Equation 2 

The remainder is given by the least significant m-bits of Z(m). 

In one traditional approach to GF multiplication in which polynomial multiplication is 
followed by a polynomial division, the resuU of the multiplication becomes the dividend for the 
reduction operation. For example. Fig. 2 illustrates a second polynomial division stage reduction 
operation 200 where remainder terms 202 and division operations 204 and 206 continue the 
examples shown in Fig. 1 A. The generator polynomial is g(x) = x^ + x^ + 1, that is, the 
coefficients of the divisor are given in binary form as (100101) . For the first example of a 
division operation 204, the binary dividend (1 1000001 1) 210 corresponds to the polynomial x^ + 
x^ + X + 1. In the other example of a division operation 206, the binary dividend (1 1 1 101010) 
212 corresponds to the polynomial x^ + x^ + x^ + x^ + x^ + x . 

Returning to Fig. 1 A, the equation 1 multiphcation terms 102 with m=5, are as follows: 
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S(0)=0 107 

S(l) = S(0) + q4*p*x'* = q4*p*x'' 108 

S(2) = S(l) + q3*P*x^ = q4*p*x^ + q3*P*x^ 109 

S(3) = S(2) + q2*p*x^ = q4*p*x'' + q3*p*x^ + q2*P*x^ 1 10 

S(4) = S(3) + q,*p*x' = q4*p*x^ + q3*p*x^ + q2*P*x^ + qi*p*x' 1 1 1 

S(m) = S(5) = S(4) + qo*p*x° = q4*p*x'* + q3*p*x^ + q2*P*x^ + qi*p*x*+ qo*p*x° 1 12 

The exemplary multiplication operations shown in Figs. IB and IC include reference numbers 

corresponding to those in the calculation result columns 104 and 106 of Fig. lA. Note that 

equation 1 generates 2m-l bits, as can be seen for the result S(5) 1 12 and the examples 150 and 

170. Thus, the results 135 and 136 have 2(5)-l or 9 bits. It is further noted that for GF 

multiplication, the carries are not calculated or propagated. 

Next equation 2 is discussed with reference numbers included in the equation steps 
corresponding to the rows of the exemplary calculations shown in Fig. 2. Fig. 2 shows 
remainder terms 202, calculation results 204 forp=11101, q=10111 and g=10010, and 
calculation results 206 for p=l 1 101, q=10010 and g=10010. For m=5 and i=l, the first term 
Z(l) is defined as: 

Z(l) = S(m) = S(5) 208 
For i-2, the next term Z(2) is: 

Z(2) = Z(l) + Z(l)8*g*x^ 214 
For i=3, the next term Z(3) is: 

Z(3) = Z(2) + Z(2)7*g*x^ 220 
For i=4, the next term Z(4) is: 

Z(4) = Z(3) + Z(3)6*g*x* 226 
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For i=5, the next term Z(5) is: 

Z(5) = Z(4) + Z(4)5*g*x° 232 

For calculation results of the division operations 204 and 206 of Fig. 2, equation 2 
specifies the remainders, and thus the GF products 240 and 242, which are (1 1010) and (11 100), 
respectively. The first GF product 240 corresponds to x*+x^+x and the second GF product 242 
corresponds to x^x^+x^ . Note that the calculation of equation 2 requires the use of 2m-l bits as 
shown in each calculation step in the table 200 calculation results columns 204 and 206. 

With the above discussion of Figs. lA, IB and 2 as background, it is next shown how the 
present invention computes the GF multiplication in a single new function stage. In the approach 
of the present invention, equation 2 is expanded by incorporating the input Z(l)=S(m) terms and 
combining the terms in the equation in such a way as to create a new and different recurrence 
relation that represents a single function stage calculation of a GF(2"') multiphcation. 

For m=5 and i=l, the first term Z(l) is defined as: 
Z(l) = S(m) = S(5) = q4*p*x''+ q3*p*x^ + q2*P*x^ + qi*p*x'+ qo*p*x° 
Using equation 2 and i=2, the next term Z(2) can be written as: 

Z(2) = Z(l) + Z(l)8*g*x^ 
Substituting for Z(l) leads to: 

Z(2) = q4*p*x'* + q3*p*x^ + q2*p*x' + qi*p*x'+ qo*p*x°+ Z(l)8*g*x^ 

Combining the common x^ terms of the input S(5), q3*p*x^ with the division recurrence term, 

Z(l)8*g*x^, changes the nature of the equation being evaluated as can be seen in the following 

steps and discussed further below. The Z(2) term using combined input terms is: 

Z(2) = q4*p*xV (q3*p + Z(l)8*g)*x^ + q2*P*^^+ qi*P*x'+ qo*P*x° 

Using equation 2 and i=3, the next term Z(3) can be written as: 
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Z(3) = Z(2) + Z(2)7*g*x2 
Substituting for Z(2) leads to: 

Z(3) = q4*p*x'*+ (q3*p + Z(l)8*g)*x3 + q2*p*x' + q,*p*x'+ qo*p*x"+ Z(2)7*g*x2 
Combining the common x^ terms of the input S(5), q2*p*x^, with the division recurrence term, 
Z(2)7*g*x^ yields: 

Z(3) = q4*p*x^+ (q3*p + Z(l)8*g)*x' + (q2*p + Z(2)7*g)*x' + qi*p*x'+ qo*p*x° 
Using equation 2 and i=4, the next term Z(4) can be written as: 

Z(4) = Z(3) + Z(3)6*g*x' 
Substituting for Z(3) leads to: 

Z(4) = q4*p*x'^+ (q3*p + Z(l)8*g)*x' + (q2*p + Z(2)7*g)*x' + q,*p*x'+ qo*p*x° + Z(3)6*g*x' 
Combining the common x' terms of the input S(5), qi*p*x', with the division recurrence term, 
Z(3)6*g*x', yields: 

Z(4) = q4*p*x^+ (q3*p + Z(l)8*g)*x3 + (q2*p + Z(2)7*g)*x^ + (qi*p + Z(3)6*g) *x' + qo*p*x° 
Using equation 2 and i=5, the next term Z(5) can be written as: 
Z(5) = Z(4) + Z(4)5*gV 
Substituting for Z(4) leads to: 

Z(5) = q4*p*x' + (q3*p + Z(l)8*g)*x^ + (q2*p + Z(2)7*g)*x' + (q,*p + Z(3)6*g) *x' + 
qo*p*x°+Z(4)5*g*x" 

Combining the common x° terms of the input S(5), qo*p*x'', with the division recurrence term, 
Z(4) 5*g*x°, yields: 

Z(5) = q4*p*x'+ (q3*p + Z(l)8*g)*x3 + (q2*p + Z(2)7*g)*x2+ (q,*p + Z(3)6*g) *x' + 
(qo*p+ Z(4) 5*g) *x° Equation 3 
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Due to the combining of the common input temis with the division recurrence terms, 
equation 3 represents a new recurrence relation that can be written in a general form as follows: 

Y(i) = Y(i-l) + (qni.i*p + Y(i-l)2m-i*g)*x'"'^ , i=l, 2, . . m and where Y(0) = 0 Equation 4 

Note that Y(m) = Z(m) but, because of the combining of the terms as described above, Y(a) ^ 
Z(a) when a<m. The intermediate terms Y(a) are not output results and the final answer Y(m) is 
all that is needed. Equation 4 can be implemented as an advantageous merging of the two 
previously separate function stages into a single function stage that is represented by the new 
recurrence relation equation 4. As indicated by Equation 4, there is no inherent algorithmic 
limitation to handling arbitrary values of m with the approach of the present invention. 

Equation 4 can be simplified further by examining the Y(i-l)2m-i term by first specifying i 
as i-1 and then substituting i-1 into Equation 4 for the purpose of determining the 2m-i bit. 
Y(i-l)2n,-i = (Y(i-2) + (qni-{i-i)*P + Y(i-2)2„,-(M)*g)*x"-^*-^> )2n,-i , which is Valid for i=2, 3, . . ., m 
and where Y(0) = 0. Because polynomial addition is equivalent to performing a bit wise 
exclusive or operation with no carries calculated, a bit of a vector exclusive or result is equal to 
the exclusive or of the input bits. Thus: 

Y(i-l)2m-i = Y(i-2)2n.-i + ((qm-(i-l)*P + Y(i-2)2m-(M)*g)*x"^-(^-^^ )2ni-i • 

As noted earlier, a polynomial with coefficients in GF[2] multiplied by the monomial x"^ 
is equivalent to shifting the bit vector of the polynomial coefficients to the left m positions with 
zero fill in. Consequently, a bit represented by (A* x'""^''^^)2m-i is equivalent to A2ni-i-(m-(i-i))=Ani.i. 
Therefore, Y(i-l)2m-i = Y(i-2)2m-i + (qm-i+i*P + Y(i-2)2m.i+i*g)2m-i-(m.(i.i)) and simplifying yields: 
Y(i-l)2ni-i = Y(i-2)2m-i + (qm-i-n*P + Y(i-2)2ni-i+i*g)m-i Equation 5 

which is valid for i=2, 3, . . m and where Y(0) = 0. 
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It can be deduced from the Equation 5 recurrence relation that Y(i-l)2m.i begins with a 
Y(0)2m-i term, 1=2, and all other terms for i=2, 3, . . m are each the m-1 bit of a previous 
calculation. Consequently, since Y(0)2m-i = Y(0)m.i = 0, Y(i-l)2m-i can be stated as: 
Y(i-l)2m-i = Y(i-l)m-i for i=2, 3, . . m and where Y(0) = 0 Equation 6 

Equation 4 can then be rewritten using Equation 6 as follows: 

Y(i) = Y(i-l) + (q„,.i*p + Y(i-l)m.i*g)*x"''^ , i=l, 2, . . ., m and where Y(0) = 0 Equation 7 

Equation 7 has been verified by a C program routine for polynomials up to degree 8, or 
equivalently, elements of GF(2'") for m less than or equal to 8, but there does not appear to be 
any inherent algorithmic limitation to handling arbitrary values of m. By way of example, a 
hardware implementation of equation 7 is described below in conjunction with the discussion of 
Figs. 3A, 3B and 3C. 

Equation 7 for m=3 becomes: 
Y(i) = Y(i.l) + (q3.i*p + Y(i-l)2*g)*x^-^ , i=l, 2, 3 and where Y(0) = 0 Equation 8 

For i=l: 

Y(l) = Y(0) + (q2*p + Y(0)2*g)*x^ = 0 + (q2*p + 0*g)*x^ Equation 9 

Equation 9 can be implemented in an exemplary circuit 300 as shown in Fig. 3 A. In Figs. 
3A, 3B, 3C, 4, and 5, AND gates are represented by a hexagon with a a symbol and exclusive or 
(XOR) gates are represented by a hexagon with a © symbol. In Fig. 3A, the term q2*p is 
generated by AND gates 302-304 and the term 0*g is generated by AND gates 305-307. The 
two sum (+) exclusive ors of equation 9 are combined in the three input XOR gates 309-3 1 1 . 
The components of the Y(0) term of Equation 9 are equal to zero by definition of equation 8 and, 
consequently, zero values are applied to the inputs 315 and 316. The p inputs (p2, pi, and po) are 
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applied to inputs 317, 318 and 319, respectively, and the generator polynomial coefficients g 
inputs (g2, gi, and go) are applied to inputs 320, 321 and 322, respectively. The input Y(0)2=0 is 
provided on input 323, a zero is provided on input 324 being at the edge of the array, and qz is 
provided on input 325. The results Y(l)2 , Y(l)i , and Y(l)o of circuit 300 appear at outputs 326, 
327, and 328, respectively. To allow the use of a common cell for implementing the array, a 
third XOR input on border cells, such as input 324 of cell 329 is set to zero. 
Continuing for i=2: 

Y(2) = Y(l) + (qi*p + Y(l)2*g)*x* = (q2*p + 0*g)*x^ + (qi*p + Y(l)2*g)*x' Equation 10 
Equation 10 can be implemented in an exemplary circuit 330 as shown in Fig. 3B. In Fig. 3B, 
the term qi*p = qi 331 * (p2 317 pi 318 po 319) is generated by AND gates 332-336 and the term 
Y(l)2*g = Y(l)2 326 * (g2 320 gi 321 go 322) is generated by AND gates 340-344. The two 
required sum (+) exclusive ors are combined in tiie three input XOR gates 350-354. Note that 
due to the x^ and x' terms in Equation 10, there is a shift of 1 bit in the exclusive or inputs 
accounting for paths 327 and 328 for Y(l)i and Y(l)o, respectively. Border cell 355 has its 
XOR third input 356 set to zero. The results of circuit 330 are outputs Y(2)2 357, Y(2)i 358, and 
Y(2)o 359. 

Continuing for i=3: 
Y(3) = Y(2) + (qo*p + Y(2)2*g)*x° 

Y(3) = (q2*p + 0*g)*x^ + (qi*p + Y(l)2*g)*x' + (qo*p + Y(2)2*g)*x° Equationl 1 

Equation 1 1 can be implemented in an exemplary circuit 360 shown in Fig. 3C where the term 
qo*P = qo 361 * (P2317 pi 318 po 319) is generated by AND gates 372-376 and the term Y(2)2*g 
= Y(2)2 357 * (g2 320 gi 321 go 322) is generated by AND gates 380-384. The two required sum 
(+) exclusive ors are combined in the three input XOR gates 390-392. Note that due to the x° 
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term in Equation 1 1 there is a shift of 1 bit in the exclusive or inputs accounting for paths 358 
and 359 for Y(2)i and Y(2)o, respectively. Border cell 395 has its XOR third input 396 set to 
zero. The results of circuit 360 are outputs Y(3)2 397, Y(3)i 398, and Y(3)o 399. 

It is noted that the above described implementation of equation 7 requires only m bits for 
each logic stage, as shown in the exemplary circuits of Fig. 3 A, 3B, and 3C. By contrast, the 
previous calculation techniques for the examples of Figs. 1 A-IC and Fig. 2 required 2m-l bits 
per logic stage. This reduction represents a savings of m-1 bits per internal logic stage that do 
not have to be accounted for in an implementation. 

Fig. 4 illustrates a GF multiplication cell 400 where cell output bit Y(i)j 402 depends on 
the most significant bit of the previous calculation Y(i-l)m-i 404, the value of its right neighbor 
bit Y(i-l)j.i 406 the result of the previous calculation, bit qm-i 408, bit pj 410, and bit gj 412. 
Internal to the GF multiplication cell 400 are two 2-input AND gates 414 and 416 and a 3-input 
XOR gate 418. The three logic gates are connected based on Equation 7, repeated here for easy 
reference to the logic gates of Fig. 4, Y(i) = Y(i-1) + (qn..i*p + Y(i-l)„.i*g)*x'"-* . The qm.i*p 
AND is accomplished for the bit position by AND gate 414, the Y(i-lVi*g AND for the 
bit position is accomplished by the AND gate 416, the XOR of these two AND results is 
accompUshed by XOR gate 418. The third input 406 to the XOR gate 418 is for the Y(i-l)j.i 
term, which, due to the x""' term of equation 7, has a shift of one bit between the previous Y(i-l) 
value and the (qm.i*p + Y(i-l)m.i*g) value. This one-bit shift is accomplished for the bit position 
Y(i)j 402 by XOR 418 having its third input being the previous Y term shifted by 1 bit, in other 
words bit Y(i-l)j.i 406. 

Note that for border cells on the rightmost edge of a GF multiplication array, for 
example, cells 329, 355 and 395 as shown in Fig. 3C, the third XOR inputs, 324, 356, and 396, 
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respectively, are set to zero. The same is true for input 406 in general cell 400 when the cell is 
used as a border cell on the rightmost edge of a GF multiplication array. A regular m-by-m 
array is constructed by replicating a common cell circuit design, such as circuit 400 of Fig. 4, 
allowing for futher optimizations, for example, using custom logic design techniques to produce 
a common cell that is of higher performance and reduced area. 

A single exemplary m=8 GF multipUcation unit 500 is shown in Fig. 5. Unit 500 consists 
of an m-by-m=8x8 array of the GF cells 400 shown in Fig. 4. The inputs q = (q? qe qs q4 q3 qi qi 
qo) 504, p = (P7 P6 P5 P4 P3 P2 Pi Po) 508, g = (g7 ge gs g4 g3 gz gi go) 512 are provided from an 
extemal source such as the read ports of at least two registers or a register file or memory device. 
The Y(i-l)m-i and the Y(i-l)j.i array border GF multiplication circuit cell input values 516 and 
520 are set to 0. The result Y = (Y(8)7 Y(8)6 Y(8)5 Y(8)4 Y(8)3 Y(8)2 Y(8), Y(8)o) output 524 is 
provided to an extemal destination such as the write port of a register or a register file or memory 
device. 

The present invention computes the GF multiplication in a single stage, hi one 
implementation, it may be embodied as an instruction for the MANARRAY^"^ architecture 
wherein 8 GF multipliers are incorporated in each processing element. This arrangement allows 
a GF multiplication instruction, as described in more detail below, using the same generator 
polynomial for each GF multiplication, to cause 8 GF multiplications to be calculated 
simultaneously on each processing element by using 8 GF multiplication units 500. The GF 
multiplication instruction implemented for the MANARRAYtm architecture accomplishes the 8 
GF multiplications by operating on packed data of 8 bytes producing 8 results on each 
processing element every cycle. With a four PE array, 32 GF multiplications can be obtained 
each cycle. For reasons of programming flexibility, the GF multiplication instruction also 
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specifies 4 GF multiplications for operation on 4 bytes packed in 32-bit words and 8 GF 
multiplications for operations on 8 bytes packed in 64-bit double words. 

More specifically and in an illustrative embodiment of the present invention, an 
exemplary ManArray 2x2 iVLIW single instruction multiple data stream (SBMD) processor 600, 
representative of the Manta processor and mobile media processor (MMP) which are both 
subsets of the ManArray architecture, as shown in Fig. 6, may be adapted as described fiirther 
below for use in conjunction with the present invention. Processor 600 comprises a sequence 
processor (SP) controller combined with a processing element-0 (PEO) to form an SP/PEO 
combined unit 601, as described in further detail in U.S. Patent No. 6,219,776. Three additional 
PEs 651, 653, and 655 are also labeled with their matrix positions as shown in parentheses for 
PEO (PEOO) 601, PEl (PE01)651, PE2 (PEIO) 653, and PE3 (PEll) 655. The SP/PEO 601 
contains an instruction fetch (I-fetch) controller 603 to allow the fetching of "short" instruction 
words (SIW) or abbreviated- instruction words fi-om a B-bit instruction memory 605, where B is 
determined by the appUcation instruction-abbreviation process to be a reduced number of bits 
representing ManArray native instructions. If an instruction abbreviation apparatus is not used, 
then B is determined by the SIW format. 

The fetch controller 603 provides the typical functions needed in a programmable 
processor, such as a program counter (PC), a branch capability, and eventpoint loop operations. 
It also provides the instruction memory control which could include an instruction cache if 
needed by an application. In addition, the I-fetch controller 603 controls the dispatch of 
instruction words, such as a GF multiplication instruction, and instruction control information to 
the other PEs in the system by means of a D-bit instruction bus 602. D is determined by the 
implementation taking into account the SIW format, which for the exemplary ManArray 
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coprocessor D=32-bits. The instruction bus 602 may include additional control signals as needed 
to distribute instructions to the multiple processing elements. 

In this exemplary system 600, common elements are used throughout to simplify the 
explanation, though actual implementations are not hmited to this restriction. For example, the 
execution units 631 in the combined SP/PEO 601 can be separated into a set of execution units 
optimized for the control functions of the SP. Fixed point execution units can be used in the SP, 
while PEO and the other PEs can be optimized for a floating point application. For the purposes 
of this description, it is assumed that the execution units 631 are of the same type in the SP/PEO 
and the PEs. The MAU execution units 632, 633, 634 and 635 each contain eight GF 
multipUcation units, each of the type 500 shown in Fig. 5, for GF multiplication instruction 
execution capabihty. The MAUs provide the GF multiplication units with register file access 
interfaces. Each of the register files contained in the SP/PEO and the other PEs are a common 
design PE configurable register file, 611, 627, 627*, 627", and 627'", which is described in further 
detail in U.S. Patent No. 6,343,356. 

The SP/PEO and the other PEs use a five instruction slot indirect very long instruction 
word (iVLIW) architecture which contains a VLIW instruction memory (VIM) 609 and an 
instruction decode and VIM controller functional unit 607 which receives instructions as 
dispatched from the SP/PEO's I-fetch unit 603 and generates VIM addresses and control signals 
608 required to access the iVLIWs stored in the VIM. Referenced instruction types are 
identified by the letters SLAMD in VIM 609, where the letters are matched up with instruction 
types as follows: Store (S), Load (L), ALU (A), MAU (M), and DSU (D). The basic concept of 
loading the iVLIWs is described in further detail in U.S. Patent No. 6,151,668. A VLIW, which 
may contain a GF multiplication instruction in the MAU slot position, may be indirectly 
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accessed from a VIM and executed upon receipt in the SP and/or PEs of an execute VLIW (XV) 
SIW. 

Due to the combined nature of the SP/PEO, the data memory interface controller 625 
must handle the data processing needs of both the SP controller, with SP data in memory 621, 
and PEO, with PEO data in memory 623. The SP/PEO controller 625 also is the controlling point 
of the data that is sent over the 32-bit or 64-bit broadcast data bus 626. The other PEs 651, 653, 
and 655 contain common physical data memory units 623', 623", and 623'" though the data 
stored in them is generally different as required by the local processing done on each PE. The 
interface to these PE data memories is also a common design in PEs 1, 2, and 3 and controlled by 
PE local memory and data bus interface logic 657, 657' and 657". Interconnecting the PEs for 
data transfer communications is the cluster switch 671 various aspects of which are described in 
greater detail in U.S. Patent Nos. 6,023,753, 6,167,501, and 6,167,502. The interface to a host 
processor, other peripheral devices, and/or extemal memory can be done in many ways. For 
completeness, a primary interface mechanism is contained in a direct memory access (DMA) 
control unit 681 that provides a scalable ManArray data bus 683 that connects to devices and 
interface units extemal to the ManArray core. The DMA control unit 681 provides the data flow 
and bus arbitration mechanisms needed for these extemal devices to interface to the ManArray 
core memories via the multiplexed bus interface represented by line 685. A high level view of a 
ManArray control bus (MCB) 691 is also shown in Fig. 6. 

Fig. 7A shows an example of a finite field multiply instruction (MPYGF) encoding 
format 700 for use in conjunction with the ManArray system 600 described above with 
appropriate hardware circuitry to perform the calculations described above in detail. Fig. 7B 
shows a syntax/operation table 750 for the instruction encoding format 700 of Fig. 7A. The 
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MPYGF instruction 700 calculates the remainder of the polynomial division of the product of 
two polynomials with coefficients from Galois field GF[2]. Four GF multiplications can be 
specified and calculated simultaneously as shown in syntax/operations description 752 and eight 
GF multiplications can be specified and calculated simultaneously as shown in 
syntax/operations description 754. The input polynomial coefficients are represented as bits in 4, 
or 8, unsigned bytes in source 32-bit registers Rx and Ry or 64-bit register pairs Rxe-Rxo, and 
pair Rye-Ryo, respectively. The results of the GF muUiplication are stored in the corresponding 
bytes of register Rt, or Rte-Rto, respectively. The arithmetic scalar flags (ASFs), representing 
possible side effects of the MPYGF operation, are affected only by the least significant sub- 
operation of a packed data operation. The C, N and V flags are not affected by the least 
significant sub-operation while the Z flag is set to a 1 if the least significant sub-operation is a 
zero. Otherwise the Z flag is 0. The MPYGF instruction is defined to take 1 execution cycle. 

A polynomial setup register (PSR) located in a register in a miscellaneous register file 
(MRF) extension 1 defined in the ManArray architecture to contain the generator polynomial 
(PSR.BO) and degree (PSR.B1) of the finite field, with the polynomial coefficients set as bits of 
byte PSR.BO. The generator polynomial must be loaded into the PSR using either a Load or a 
DSU instruction. Note that m cannot exceed 8 for this instruction due to the present hardware 
specification, but, as shown by Equation 7, there does not appear to be any algorithmic 
limitation to handling arbitrary values of m. 

By way of example, to calculate sixteen GF(2^) multiplications of unsigned bytes stored 
in the compute register file registers Rx.BO times Ry.BO in a four PE system such as shown in 
Fig. 6, in the field generated by the generator polynomial g(x) = x^ + x^ + 1, a program first loads 
the PSR byte 0 and byte 1 with the generator coefficients and m respectively, and then issues the 
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mpygf instruction, mpygf.sm.4ub Rt, Rx, Ry to all four PEs. When the instruction is executed, 
four sets of m-bit results are stored in register file target register Rt in each PE. For example, 
one of the calculations in one of the PEs that executes the mpygf.sm.4ub Rt, Rx, Ry instruction 
could be Rx.BO=(l 1 101) times Ry.B0=(101 1 1) producing the resuU (11010) which is stored in 
the register file target register Rt.BO 755. Note that three other mpygf operations 757 also occur 
in parallel on this PE as specified by the quad operation mpygf instiiiction 700 and a total of 
sixteen GF multipUcations occur in parallel on all four PEs. 

A program or programs, tiiat emulate a GF multipUcation or use a GF multiplication 
instiiiction based on the principles of the present invention, can be stored in an electi-onic form 
on a computer useable medium which can include diskettes, CD-ROM, DVD-ROM, storage on a 
hard drive, storage in a memory device using random access memory, flash memory. Read Only 
Memory or the like, in downloadable form for downloading through an electi-onic ti-ansport 
medium, and the like. 

While the present invention has been disclosed in the context of various aspects of 
presently preferred embodiments, it will be recognized that the invention may be suitably applied 
to other environments and applications consistent with the claims which follow. 
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